如何使用 R 删除字符串中其他两个重复出现的字符之间的所有字符?
How can I remove all characters between two other recurring characters in a string using R?
在使用 gsub 帮助之前,以下代码成功地获取了我需要的文本 "clean."
am1<-getURL("url.com")
ami1<-htmlTreeParse(am1, useInternalNodes = TRUE)
ami1.tree.parse<- unlist(xpathApply(ami1, path = '//td', fun = xmlValue))
ami1.txt<-NULL
for (i in 2:(length(ami1.tree.parse)-1)) {
ami1.txt<-paste(ami1.txt, as.character(ami1.tree.parse[i]), sep = ' ')
}
问题
我无法删除采访文本中的全部问题。例如,文本如下所示:
[1] "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."
为了格式化:
"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."
为了绝对清楚,我需要从上面的文字中得到的是:
[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."
"It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."
我试过:
ami1.txt<-gsub("Q.[^?]+H:", "",ami1.txt)
ami1.txt<-gsub("Q.[^?]+H: ", "",ami1.txt)
ami1.txt<-gsub("Q.*H:", "",ami1.txt)
归结为我没有把握regex,但如果有人能指出正确的方向,我将不胜感激。
唉,我撒谎了,文本显然有点复杂。我在上面文本的末尾添加了更复杂的元素,如下所示。有的"questions"(Q.)开头一句话:
str2<-"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively.Q. That's interesting. When would you consider speaking to her?JOE SMITH: Probably, tomorrow. Q. That sounds good. How do you feel now? Better than before?JOE SMITH: Yeah I'm feeling alright."
问。您认为您的婚姻进展如何?乔·史密斯:一切顺利 alright.Q。五年后您认为自己在哪里?乔·史密斯:我可能会搬到洛杉矶并进入 acting.Q。好的。你认为你的妻子对你的想法有何看法?乔·史密斯:我想她会回应 positively.Q。那很有意思。你会考虑什么时候和她谈谈? 乔·史密斯:可能,明天。问:听起来不错。你现在感觉怎么样?比以前好多了?乔·史密斯:是的,我感觉很好。
任务保持不变,akrun 的回答让我很接近:
trimws(gsub("Q[^?]+\?|[A-Z ]+:", "", str2))
print(str2)
[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow. Better than before? Yeah I'm feeling alright."
[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow. Better than before? Yeah I'm feeling alright."
最终更新
阿克伦的回答:
trimws(gsub("Q[^?]+\?|[A-Z ]+:", "", str2))
我不完全确定为什么上面的答案没有完全删除 "Q" 和最后一个问号之间的所有内容,但是唉。在对我上面的问题进行修改之后,我发现我真正要寻找的是从 "Q" 到“:”的所有内容都被删除。所以我用这个 tool 来帮助我理解我对正则表达式的理解有什么问题。我得到以下内容以清除 "Q" 和“:”之间的所有字符。
gsub("Q[^:]+\?|[A-Z ]+:", "", str2)
我们可以匹配以 Q 开头的字符后跟非 ?
([^?]
) 后跟问号或 (|
) 大写字母的字符通过 :
并将其替换为空格。如果有leading/lagging个空格,就用trimws
trimws(gsub("Q[^?]+\?|[A-Z ]+:", "", str1))
#[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."
数据
str1 <- "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."
在使用 gsub 帮助之前,以下代码成功地获取了我需要的文本 "clean."
am1<-getURL("url.com")
ami1<-htmlTreeParse(am1, useInternalNodes = TRUE)
ami1.tree.parse<- unlist(xpathApply(ami1, path = '//td', fun = xmlValue))
ami1.txt<-NULL
for (i in 2:(length(ami1.tree.parse)-1)) {
ami1.txt<-paste(ami1.txt, as.character(ami1.tree.parse[i]), sep = ' ')
}
问题
我无法删除采访文本中的全部问题。例如,文本如下所示:
[1] "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."
为了格式化:
"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."
为了绝对清楚,我需要从上面的文字中得到的是:
[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."
"It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."
我试过:
ami1.txt<-gsub("Q.[^?]+H:", "",ami1.txt)
ami1.txt<-gsub("Q.[^?]+H: ", "",ami1.txt)
ami1.txt<-gsub("Q.*H:", "",ami1.txt)
归结为我没有把握regex,但如果有人能指出正确的方向,我将不胜感激。
唉,我撒谎了,文本显然有点复杂。我在上面文本的末尾添加了更复杂的元素,如下所示。有的"questions"(Q.)开头一句话:
str2<-"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively.Q. That's interesting. When would you consider speaking to her?JOE SMITH: Probably, tomorrow. Q. That sounds good. How do you feel now? Better than before?JOE SMITH: Yeah I'm feeling alright."
问。您认为您的婚姻进展如何?乔·史密斯:一切顺利 alright.Q。五年后您认为自己在哪里?乔·史密斯:我可能会搬到洛杉矶并进入 acting.Q。好的。你认为你的妻子对你的想法有何看法?乔·史密斯:我想她会回应 positively.Q。那很有意思。你会考虑什么时候和她谈谈? 乔·史密斯:可能,明天。问:听起来不错。你现在感觉怎么样?比以前好多了?乔·史密斯:是的,我感觉很好。
任务保持不变,akrun 的回答让我很接近:
trimws(gsub("Q[^?]+\?|[A-Z ]+:", "", str2))
print(str2)
[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow. Better than before? Yeah I'm feeling alright."
[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow. Better than before? Yeah I'm feeling alright."
最终更新
阿克伦的回答:
trimws(gsub("Q[^?]+\?|[A-Z ]+:", "", str2))
我不完全确定为什么上面的答案没有完全删除 "Q" 和最后一个问号之间的所有内容,但是唉。在对我上面的问题进行修改之后,我发现我真正要寻找的是从 "Q" 到“:”的所有内容都被删除。所以我用这个 tool 来帮助我理解我对正则表达式的理解有什么问题。我得到以下内容以清除 "Q" 和“:”之间的所有字符。
gsub("Q[^:]+\?|[A-Z ]+:", "", str2)
我们可以匹配以 Q 开头的字符后跟非 ?
([^?]
) 后跟问号或 (|
) 大写字母的字符通过 :
并将其替换为空格。如果有leading/lagging个空格,就用trimws
trimws(gsub("Q[^?]+\?|[A-Z ]+:", "", str1))
#[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."
数据
str1 <- "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."