如何计算 R 中文本中的年份？

Question

我想计算以下名为 txt 的文本中左括号和右括号之间的年份。

library(stringr)
txt <- "Text Mining exercise (2020) Mining, p. 628508; Computer Science text analysis (1998) Computer Science, p.345-355; Introduction to data mining (2015) J. Data Science, pp. 31-33"

lengths(strsplit(txt,"\(\d{4}\)")) 给我 4 这是错误的。有什么帮助吗？

Answer 1

您可以将 str_extract_all 与正面前瞻和后视正则表达式结合使用。

stringr::str_extract_all(txt, '(?<=\()\d+(?=\))')[[1]]
#[1] "2020" "1998" "2015"

如果你想计算目前有多少人，请使用 length。

length(stringr::str_extract_all(txt, '(?<=\()\d+(?=\))')[[1]])
#[1] 3

可能，使用 str_match_all 更容易

stringr::str_match_all(txt, '\((\d+)\)')[[1]][, 2]
#[1] "2020" "1998" "2015"

Answer 2

如果你更喜欢 Base-R

regmatches(txt, gregexpr("[^0-9]\d{4}[^0-9]", txt))

给予

[[1]]
[1] "(2020)" "(1998)" "(2015)"

如果我们将其包裹在lengths( ... )中，我们将得到正确答案

编辑：或者如果您真的只想要计数，我们可以缩短为

lengths(gregexpr("[^0-9]\d{4}[^0-9]", txt))

Answer 3

我想你在找 stringr::str_count():

str_count(txt, "\([0-9]{4}\)")
[1] 3

要仅在括号内包含同样以 1 或 2 开头后跟 0 或 9 的四位数：

str_count(txt, "\([1-2][0|9][0-9]{2}\)")

严格以 19 或 20 开头：

str_count(txt, "\(19[0-9]{2}\)|\(20[0-9]{2}\)")
# In R 4.0
str_count(txt, r"(\(19[0-9]{2}\)|\(20[0-9]{2}\))")

如何计算 R 中文本中的年份？

How to count years in the text in R?

r

strsplit

stringr