从R中的大字符串中提取带小数的数字

Extracting numbers with decimals from large strings in R

我想从这个由 15 个观察值组成的向量中提取数字:

rs <- c("\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.0\n                    (1 rating)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            9 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.7\n                    (4 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            34 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.1\n                    (5 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            22 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    2.4\n                    (14 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            2,106 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.3\n                    (67 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            1,287 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (3 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            30 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        New\n    \n\n\n                \n\n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    0.0\n                    (0 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            8 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        Highest Rated\n    \n\n\n                \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            42 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.4\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            41 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.2\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            115 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            25 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (19 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            151 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.5\n                    (10 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            385 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (166 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            754 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.6\n                    (34 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            3,396 students enrolled\n        \n    \n\n\n    \n\n    "
)

如您所见,有 15 个物体很长而且很脏。但是,它们内部的图案很容易识别。每个对象由3个数字组成(以第一次观察为例):

我想提取所有这些数值并创建一个包含 3 列的数据框,每列代表每个变量。

我一直在 Whosebug 中检查几个问题,主要集中在包 stringrgsub() 的使用上。但是,我找不到解决问题的关键。

更新

这些是我试过的代码:

as.numeric(str_extract(rs, "[0-9]+"))
as.numeric(str_extract(rs, "[0-9]+")[[1]])
as.numeric(str_extract(rs, "(?<=\()[0-9]+(?=\))"))
as.numeric(sapply(strsplit(rs, " "), "[[", 1))

利用 tidyr 中的 extract,我们可以:

library(dplyr)
library(tidyr)

data.frame(rs, stringsAsFactors = FALSE) %>%
  extract(rs, c("Rating", "Number_of_ratings", "Students_enrolled"),
          "(?s)(\d\.\d).*?(\d+)\s*ratings?.*?(\d+(?:,\d+)?)\s*students enrolled", 
          convert = TRUE) %>%
  mutate(Students_enrolled = as.numeric(sub(",", "", Students_enrolled)))

输出:

   Rating Number_of_ratings Students_enrolled
1     4.0                 1                 9
2     4.7                 4                34
3     3.1                 5                22
4     2.4                14              2106
5     4.3                67              1287
6     4.6                 3                30
7     0.0                 0                 8
8     4.6                12                42
9     4.4                 6                41
10    4.2                12               115
11    4.8                 6                25
12    4.6                19               151
13    4.5                10               385
14    4.8               166               754
15    3.6                34              3396

备注:

正则表达式看起来很复杂,其实不然。 extract 所做的是从每个捕获组中提取匹配项(用括号括起来的东西)并将它们变成自己的列。

  1. (?s) 是打开 "DOTALL" 模式的修饰符。这允许点 . 也匹配换行符。

  2. (\d\.\d) 匹配 Rating 模式

  3. (\d+)\s*ratings 匹配 Number_of_ratings 模式但只提取数字 (\d+)

  4. (\d+(?:,\d+)?)\s*students enrolled 匹配 Students_enrolled 模式,但只提取 "digits with or without comma" 模式

  5. convert = TRUE 尝试将结果列转换为其最佳数据类型,但由于 Students_enrolled 中有逗号,因此需要额外的 mutate 来转换它到 numeric

通常,如果捕获组的数量不等于输出列的数量,extract 会抛出错误,但由于修饰符 (?s) 和非捕获组 (?:...)不考虑捕获组,捕获组计数与列计数匹配。

所以你的问题是它没有看到“。”作为数字的一部分,因为它在字符串中。所以你需要明确地找到数字和小数点。

Rating <- as.numeric(str_extract(rs, "[0-9]\.[0-9]"))
NRatings <- str_extract(rs, "\([0-9]") %>% str_replace("\(","") %>% as.numeric() 

我会让你根据这些例子找出最后一个;)

1 依赖基础 R 解决方案,带有注释、可读的正则表达式。

这还展示了如何清理文本以进行处理(以一种您可以重复使用的方式)。

library(stringi)

do.call(
  rbind.data.frame,
  lapply(
    stri_match_all_regex(
      stri_replace_all_regex(
        stri_trim_both(rs),             # clean up outer spaces
        "[[:blank:][:space:]]+", " "    # clean up inner spaces
      ),
      "
([[:digit:]\.]+)[[:space:]]+\(([[:digit:],]+)[[:space:]]+rating[s]*\)# pick up the rating and total number of ratings
[^[:digit:]]*([[:digit:],]+)[[:space:]]+student[s]*[[:space:]]+enrolled                          # pick up the number of students enrolled
",
      opts_regex = stri_opts_regex(comments = TRUE),
    ),
    function(x) {
      as.list(
        setNames(
          x[2:4], c("rating", "n_ratings", "enrolled")
        ),
        stringsAsFactors = FALSE
      )
    }
  )
)

导致:

##    rating n_ratings enrolled
## 2     4.0         1        9
## 21    4.7         4       34
## 3     3.1         5       22
## 4     2.4        14    2,106
## 5     4.3        67    1,287
## 6     4.6         3       30
## 7     0.0         0        8
## 8     4.6        12       42
## 9     4.4         6       41
## 10    4.2        12      115
## 11    4.8         6       25
## 12    4.6        19      151
## 13    4.5        10      385
## 14    4.8       166      754
## 15    3.6        34    3,396

之后将 ^^ 转换为 # 是非常基本的。