用R去掉某个单词前的字符串
Remove the string before a certain word with R
我有一个字符向量需要清理。具体来说,我想删除单词 "Votes." 之前的数字 请注意,该数字用逗号分隔千位,因此将其视为字符串更容易。
我知道 gsub("*. Votes","", text) 会删除所有内容,但我该如何删除数字?另外,如何将重复的 space 折叠成一个 space?
感谢您的帮助!
示例数据:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
您可以使用
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\s){2,}|\d[0-9,]*\s*(Votes)", "\1\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
参见online R demo and the online regex demo。
详情
(\s){2,}
- 匹配 2 个或更多空白字符,同时捕获将使用替换模式中的 </code> 占位符重新插入的最后一次出现 </li>
<li><code>|
- 或
\d
- 一个数字
[0-9,]*
- 0 个或更多数字或逗号
\s*
- 0+ 个空白字符
(Votes)
- 第 2 组(将使用 </code> 占位符在输出中恢复):<code>Votes
子字符串。
请注意 trimws
将删除任何 leading/trailing 空格。
最简单的方法是 stringr
:
> library(stringr)
> regexp <- "-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
做同样的事情但只提取数字,将其包装在 gsub
:
> gsub('\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
这是一个版本,它会删除单词 "Votes" 之前的所有数字,即使其中有逗号或句点:
> gsub('\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
如果你也想要标签,那么就扔掉 gsub
部分:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
如果您想提取所有数字:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"
我有一个字符向量需要清理。具体来说,我想删除单词 "Votes." 之前的数字 请注意,该数字用逗号分隔千位,因此将其视为字符串更容易。
我知道 gsub("*. Votes","", text) 会删除所有内容,但我该如何删除数字?另外,如何将重复的 space 折叠成一个 space?
感谢您的帮助!
示例数据:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
您可以使用
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\s){2,}|\d[0-9,]*\s*(Votes)", "\1\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
参见online R demo and the online regex demo。
详情
(\s){2,}
- 匹配 2 个或更多空白字符,同时捕获将使用替换模式中的</code> 占位符重新插入的最后一次出现 </li> <li><code>|
- 或\d
- 一个数字[0-9,]*
- 0 个或更多数字或逗号\s*
- 0+ 个空白字符(Votes)
- 第 2 组(将使用</code> 占位符在输出中恢复):<code>Votes
子字符串。
请注意 trimws
将删除任何 leading/trailing 空格。
最简单的方法是 stringr
:
> library(stringr)
> regexp <- "-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
做同样的事情但只提取数字,将其包装在 gsub
:
> gsub('\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
这是一个版本,它会删除单词 "Votes" 之前的所有数字,即使其中有逗号或句点:
> gsub('\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
如果你也想要标签,那么就扔掉 gsub
部分:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
如果您想提取所有数字:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\.*,*[[:digit:]]*\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"