从R中的大字符串中提取带小数的数字
Extracting numbers with decimals from large strings in R
我想从这个由 15 个观察值组成的向量中提取数字:
rs <- c("\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.0\n (1 rating)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 9 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.7\n (4 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 34 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 3.1\n (5 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 22 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 2.4\n (14 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 2,106 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.3\n (67 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 1,287 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (3 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 30 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n \n\n \n New\n \n\n\n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 0.0\n (0 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 8 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n \n\n \n Highest Rated\n \n\n\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (12 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 42 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.4\n (6 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 41 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.2\n (12 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 115 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.8\n (6 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 25 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (19 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 151 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.5\n (10 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 385 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.8\n (166 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 754 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 3.6\n (34 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 3,396 students enrolled\n \n \n\n\n \n\n "
)
如您所见,有 15 个物体很长而且很脏。但是,它们内部的图案很容易识别。每个对象由3个数字组成(以第一次观察为例):
- 评分:从 0 到 5。例如,
4.0
- 评分数。例如
(1 rating)
- 已注册学生。例如
9 students enrolled
.
我想提取所有这些数值并创建一个包含 3 列的数据框,每列代表每个变量。
我一直在 Whosebug 中检查几个问题,主要集中在包 stringr
的 gsub()
的使用上。但是,我找不到解决问题的关键。
更新
这些是我试过的代码:
as.numeric(str_extract(rs, "[0-9]+"))
as.numeric(str_extract(rs, "[0-9]+")[[1]])
as.numeric(str_extract(rs, "(?<=\()[0-9]+(?=\))"))
as.numeric(sapply(strsplit(rs, " "), "[[", 1))
利用 tidyr
中的 extract
,我们可以:
library(dplyr)
library(tidyr)
data.frame(rs, stringsAsFactors = FALSE) %>%
extract(rs, c("Rating", "Number_of_ratings", "Students_enrolled"),
"(?s)(\d\.\d).*?(\d+)\s*ratings?.*?(\d+(?:,\d+)?)\s*students enrolled",
convert = TRUE) %>%
mutate(Students_enrolled = as.numeric(sub(",", "", Students_enrolled)))
输出:
Rating Number_of_ratings Students_enrolled
1 4.0 1 9
2 4.7 4 34
3 3.1 5 22
4 2.4 14 2106
5 4.3 67 1287
6 4.6 3 30
7 0.0 0 8
8 4.6 12 42
9 4.4 6 41
10 4.2 12 115
11 4.8 6 25
12 4.6 19 151
13 4.5 10 385
14 4.8 166 754
15 3.6 34 3396
备注:
正则表达式看起来很复杂,其实不然。 extract
所做的是从每个捕获组中提取匹配项(用括号括起来的东西)并将它们变成自己的列。
(?s)
是打开 "DOTALL" 模式的修饰符。这允许点 .
也匹配换行符。
(\d\.\d)
匹配 Rating
模式
(\d+)\s*ratings
匹配 Number_of_ratings
模式但只提取数字 (\d+)
(\d+(?:,\d+)?)\s*students enrolled
匹配 Students_enrolled
模式,但只提取 "digits with or without comma" 模式
convert = TRUE
尝试将结果列转换为其最佳数据类型,但由于 Students_enrolled
中有逗号,因此需要额外的 mutate
来转换它到 numeric
通常,如果捕获组的数量不等于输出列的数量,extract
会抛出错误,但由于修饰符 (?s)
和非捕获组 (?:...)
不考虑捕获组,捕获组计数与列计数匹配。
所以你的问题是它没有看到“。”作为数字的一部分,因为它在字符串中。所以你需要明确地找到数字和小数点。
Rating <- as.numeric(str_extract(rs, "[0-9]\.[0-9]"))
NRatings <- str_extract(rs, "\([0-9]") %>% str_replace("\(","") %>% as.numeric()
我会让你根据这些例子找出最后一个;)
1 依赖基础 R 解决方案,带有注释、可读的正则表达式。
这还展示了如何清理文本以进行处理(以一种您可以重复使用的方式)。
library(stringi)
do.call(
rbind.data.frame,
lapply(
stri_match_all_regex(
stri_replace_all_regex(
stri_trim_both(rs), # clean up outer spaces
"[[:blank:][:space:]]+", " " # clean up inner spaces
),
"
([[:digit:]\.]+)[[:space:]]+\(([[:digit:],]+)[[:space:]]+rating[s]*\)# pick up the rating and total number of ratings
[^[:digit:]]*([[:digit:],]+)[[:space:]]+student[s]*[[:space:]]+enrolled # pick up the number of students enrolled
",
opts_regex = stri_opts_regex(comments = TRUE),
),
function(x) {
as.list(
setNames(
x[2:4], c("rating", "n_ratings", "enrolled")
),
stringsAsFactors = FALSE
)
}
)
)
导致:
## rating n_ratings enrolled
## 2 4.0 1 9
## 21 4.7 4 34
## 3 3.1 5 22
## 4 2.4 14 2,106
## 5 4.3 67 1,287
## 6 4.6 3 30
## 7 0.0 0 8
## 8 4.6 12 42
## 9 4.4 6 41
## 10 4.2 12 115
## 11 4.8 6 25
## 12 4.6 19 151
## 13 4.5 10 385
## 14 4.8 166 754
## 15 3.6 34 3,396
之后将 ^^ 转换为 # 是非常基本的。
我想从这个由 15 个观察值组成的向量中提取数字:
rs <- c("\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.0\n (1 rating)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 9 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.7\n (4 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 34 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 3.1\n (5 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 22 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 2.4\n (14 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 2,106 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.3\n (67 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 1,287 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (3 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 30 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n \n\n \n New\n \n\n\n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 0.0\n (0 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 8 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n \n\n \n Highest Rated\n \n\n\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (12 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 42 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.4\n (6 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 41 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.2\n (12 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 115 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.8\n (6 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 25 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (19 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 151 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.5\n (10 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 385 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.8\n (166 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 754 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 3.6\n (34 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 3,396 students enrolled\n \n \n\n\n \n\n "
)
如您所见,有 15 个物体很长而且很脏。但是,它们内部的图案很容易识别。每个对象由3个数字组成(以第一次观察为例):
- 评分:从 0 到 5。例如,
4.0
- 评分数。例如
(1 rating)
- 已注册学生。例如
9 students enrolled
.
我想提取所有这些数值并创建一个包含 3 列的数据框,每列代表每个变量。
我一直在 Whosebug 中检查几个问题,主要集中在包 stringr
的 gsub()
的使用上。但是,我找不到解决问题的关键。
更新
这些是我试过的代码:
as.numeric(str_extract(rs, "[0-9]+"))
as.numeric(str_extract(rs, "[0-9]+")[[1]])
as.numeric(str_extract(rs, "(?<=\()[0-9]+(?=\))"))
as.numeric(sapply(strsplit(rs, " "), "[[", 1))
利用 tidyr
中的 extract
,我们可以:
library(dplyr)
library(tidyr)
data.frame(rs, stringsAsFactors = FALSE) %>%
extract(rs, c("Rating", "Number_of_ratings", "Students_enrolled"),
"(?s)(\d\.\d).*?(\d+)\s*ratings?.*?(\d+(?:,\d+)?)\s*students enrolled",
convert = TRUE) %>%
mutate(Students_enrolled = as.numeric(sub(",", "", Students_enrolled)))
输出:
Rating Number_of_ratings Students_enrolled
1 4.0 1 9
2 4.7 4 34
3 3.1 5 22
4 2.4 14 2106
5 4.3 67 1287
6 4.6 3 30
7 0.0 0 8
8 4.6 12 42
9 4.4 6 41
10 4.2 12 115
11 4.8 6 25
12 4.6 19 151
13 4.5 10 385
14 4.8 166 754
15 3.6 34 3396
备注:
正则表达式看起来很复杂,其实不然。 extract
所做的是从每个捕获组中提取匹配项(用括号括起来的东西)并将它们变成自己的列。
(?s)
是打开 "DOTALL" 模式的修饰符。这允许点.
也匹配换行符。(\d\.\d)
匹配Rating
模式(\d+)\s*ratings
匹配Number_of_ratings
模式但只提取数字(\d+)
(\d+(?:,\d+)?)\s*students enrolled
匹配Students_enrolled
模式,但只提取 "digits with or without comma" 模式convert = TRUE
尝试将结果列转换为其最佳数据类型,但由于Students_enrolled
中有逗号,因此需要额外的mutate
来转换它到 numeric
通常,如果捕获组的数量不等于输出列的数量,extract
会抛出错误,但由于修饰符 (?s)
和非捕获组 (?:...)
不考虑捕获组,捕获组计数与列计数匹配。
所以你的问题是它没有看到“。”作为数字的一部分,因为它在字符串中。所以你需要明确地找到数字和小数点。
Rating <- as.numeric(str_extract(rs, "[0-9]\.[0-9]"))
NRatings <- str_extract(rs, "\([0-9]") %>% str_replace("\(","") %>% as.numeric()
我会让你根据这些例子找出最后一个;)
1 依赖基础 R 解决方案,带有注释、可读的正则表达式。
这还展示了如何清理文本以进行处理(以一种您可以重复使用的方式)。
library(stringi)
do.call(
rbind.data.frame,
lapply(
stri_match_all_regex(
stri_replace_all_regex(
stri_trim_both(rs), # clean up outer spaces
"[[:blank:][:space:]]+", " " # clean up inner spaces
),
"
([[:digit:]\.]+)[[:space:]]+\(([[:digit:],]+)[[:space:]]+rating[s]*\)# pick up the rating and total number of ratings
[^[:digit:]]*([[:digit:],]+)[[:space:]]+student[s]*[[:space:]]+enrolled # pick up the number of students enrolled
",
opts_regex = stri_opts_regex(comments = TRUE),
),
function(x) {
as.list(
setNames(
x[2:4], c("rating", "n_ratings", "enrolled")
),
stringsAsFactors = FALSE
)
}
)
)
导致:
## rating n_ratings enrolled
## 2 4.0 1 9
## 21 4.7 4 34
## 3 3.1 5 22
## 4 2.4 14 2,106
## 5 4.3 67 1,287
## 6 4.6 3 30
## 7 0.0 0 8
## 8 4.6 12 42
## 9 4.4 6 41
## 10 4.2 12 115
## 11 4.8 6 25
## 12 4.6 19 151
## 13 4.5 10 385
## 14 4.8 166 754
## 15 3.6 34 3,396
之后将 ^^ 转换为 # 是非常基本的。