R 字符串和子集

Question

我有一个很长的 html 字符串

长度 - 1
Class 和模式 - 字符

......uygdasd class="vip" title="Click this link to access The Big Bang Theory: The Complete Fourth Season (DVD, 2011, 3-Disc Set).....

是否可以根据其中的文本提取该字符串的一部分。减去从 class="vip" title="Click this link to access 到 (DVD, 2011 的所有内容，结果得到这个

The Big Bang Theory: The Complete Fourth Season

感谢您的帮助。

Answer 1

使用分组运算符()。这会丢弃 "link to access " 之前和 "DVD," 之后的任何内容，只保留第二组的匹配。表达式 .+ 表示 <anything, of any length>。有关“^”和“$”的解释以及 \N 在替换中的使用的更多详细信息，请参阅 ?regex 帮助页面：

 htxt <- 'uygdasd class="vip" title="Click this link to access The Big Bang Theory: The Complete Fourth Season (DVD, 2011, 3-Disc Set).....'

gsub(pattern= "^(.+link to access )(.+)( \(DVD,.+$)", "\2", htxt)
[1] "The Big Bang Theory: The Complete Fourth Season"

当然，对这个问题有一个著名的、高票数的回答：RegEx match open tags except XHTML self-contained tags

R 字符串和子集

R string and subset

string

substring

r

character

substr