使用正则表达式用四舍五入的数字替换字符串中的数字
Replacing number in a string with rounded number using regex
我有一个数据框,其中包含来自不同人的评论(因此可以以他们想要的任何形式编写)。示例数据框如下图(这只是一个示例,我的原始数据集有50000多行):
structure(list(comment = c("3.22%-1ST 0K/1.15% BAL", "3.25% ON 1ST 0,000/1.1625% ON BAL",
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL",
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL",
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST 0K + 1.1625% BALANCE",
"3.255% ON 1ST 0K & 1.1625% ON BALANCE", "3.225% ON 1ST 0,000 AND 1.16% ON BALANCE",
"3.255% ON FIRST 0,000 AND 1.1625% ON BALANCE", "00", ",500",
",000", ",000", "00.00", ",000 PLUS BONUS ,000",
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST 0,000 + 1.15% ON BAL",
"THE GREATER ,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST 0,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA,
-26L), class = c("tbl_df", "tbl", "data.frame"))
1 3.22%-1ST 0K/1.15% BAL TRUE
2 3.25% ON 1ST 0,000/1.1625% ON BAL TRUE
3 3.225% 1ST 100K/1.1625 ON BAL TRUE
4 3.22% 1ST 100K/1.15% ON BAL TRUE
5 3.255% 1ST 100K/1.1625% ON BAL TRUE
6 3.2% 1ST 100K/1.15% ON BAL TRUE
7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
8 3.255% 1ST 100K/1.1625% ON BAL TRUE
9 3.22% ON 1ST 100K / 1.1625% ON BAL TRUE
10 3.2% 1ST 100K/1.15% ON BAL TRUE
11 ............................ ....
如您所见,此数据框没有特定格式,因此很难使用。
我想做的事: 我想更改所有数字,例如 3.255%、3.2%、3.22%、4 等(基本上,每个数字中的 0 到 5 范围内的数字)评论格式相同,如 x.yz%
格式。
挑战是什么?这里的主要挑战是有些行不是以 $4000 或“THE GREATER $3,000 OR .5% OF SALE PRICE”这样的数字开头显然是不同的格式。一种方法是分隔以数字开头的行,并暂时用“TRUE”或“FALSE”标记它们。我在下面写了命令(但不确定这是否是个好主意!):
df_com$a <- str_detect(df_com$comment, pattern = "^\d")
但是,使用它我们可能会错过诸如“THE GREATER $3,000 OR .5% OF SALE PRICE”之类的行。此行应更改为“THE GREATER $3,000 OR .50% OF SALE PRICE”.
此外,为了替换数字并将它们四舍五入,我按照此处解释的答案进行操作: 并以如下所示的形式修改它以完成任务:
gsubfn("(\d\w{1:3})", ~format(round(as.numeric(x), 2), nsmall=2), x)
但是,这个表达式不起作用。
预期的结果是这种形式:
1 3.22%-1ST 0K/1.15% BAL TRUE
2 3.25% ON 1ST 0,000/1.16% ON BAL TRUE
3 3.23% 1ST 100K/1.16 ON BAL TRUE
4 3.22% 1ST 100K/1.15% ON BAL TRUE
5 3.26% 1ST 100K/1.16% ON BAL TRUE
6 3.20% 1ST 100K/1.15% ON BAL TRUE
7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
8 3.26% 1ST 100K/1.16% ON BAL TRUE
9 3.22% ON 1ST 100K / 1.16% ON BAL TRUE
10 3.20% 1ST 100K/1.15% ON BAL TRUE
11 ............................ ....
关于如何完成这项任务有什么建议吗?
在 base R 中,这是使用 gregexpr
和 regmatches
的好地方:
gre <- gregexpr("\b?[0-9]*\.[0-9]*(?=%)", df_com$comment, perl = TRUE)
str(regmatches(df_com$comment, gre))
# List of 26
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr "3.23"
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.22" "1.16"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.16"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.23" "1.16"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr(0)
# $ : chr [1:2] "3.25" "1.16"
regmatches(df_com$comment, gre) <-
lapply(regmatches(df_com$comment, gre), function(nums) {
format(round(as.numeric(nums), 2), nsmall=2)
})
结果:
df_com
# # A tibble: 26 x 2
# comment a
# <chr> <lgl>
# 1 3.22%-1ST 0K/1.15% BAL TRUE
# 2 3.25% ON 1ST 0,000/1.16% ON BAL TRUE
# 3 3.23% 1ST 100K/1.1625 ON BAL TRUE
# 4 3.22% 1ST 100K/1.15% ON BAL TRUE
# 5 3.25% 1ST 100K/1.16% ON BAL TRUE
# 6 3.20% 1ST 100K/1.15% ON BAL TRUE
# 7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
# 8 3.25% 1ST 100K/1.16% ON BAL TRUE
# 9 3.22% ON 1ST 100K / 1.16% ON BAL TRUE
# 10 3.20% 1ST 100K/1.15% ON BAL TRUE
# # ... with 16 more rows
我用了x.yz%
的字面意思,这就解释了为什么第3行有一个1.1625
没变。
我们可以使用 {stringr} 包中的 str_replace_all
并使用一个函数作为 replacement
参数。如果符合您的需要,请参阅下面我更新的答案。
library(dplyr)
library(stringr)
dat <- structure(list(comment = c("3.22%-1ST 0K/1.15% BAL", "3.25% ON 1ST 0,000/1.1625% ON BAL",
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL",
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL",
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST 0K + 1.1625% BALANCE",
"3.255% ON 1ST 0K & 1.1625% ON BALANCE", "3.225% ON 1ST 0,000 AND 1.16% ON BALANCE",
"3.255% ON FIRST 0,000 AND 1.1625% ON BALANCE", "00", ",500",
",000", ",000", "00.00", ",000 PLUS BONUS ,000",
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST 0,000 + 1.15% ON BAL",
"THE GREATER ,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST 0,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA,
-26L), class = c("tbl_df", "tbl", "data.frame"))
dat2 <- dat %>%
mutate(comment2 =
str_replace_all(comment,
"[0-9]\.[0-9]*",
function(x) format(round(as.numeric(x), 2), nsmall = 2)) %>%
str_replace_all("\s\.[0-9]*",
function(x) paste0(" ", format(round(as.numeric(x), 2), nsmall = 2))) %>%
str_replace_all("([0-9])-",
"\1.00%-")
)
tail(dat2)
#> # A tibble: 6 x 3
#> comment a comment2
#> <chr> <lgl> <chr>
#> 1 ,000 PLUS BONUS ,000 FALSE ,000 PLUS BONUS ,000
#> 2 4-100/1.1625 TRUE 4.00%-100/1.16
#> 3 3.2% 1ST 100K/1.15% ON BAL TRUE 3.20% 1ST 100K/1.15% ON BAL
#> 4 3.2% ON 1ST 0,000 + 1.15% ON BAL TRUE 3.20% ON 1ST 0,000 + 1.15% ON ~
#> 5 THE GREATER ,000 OR .5% OF SALE PR~ FALSE THE GREATER ,000 OR 0.50% OF SA~
#> 6 **3.255% ON THE 1ST 0,000 AND 1.1~ FALSE **3.26% ON THE 1ST 0,000 AND 1~
由 reprex package (v0.3.0)
于 2020-11-05 创建
我有一个数据框,其中包含来自不同人的评论(因此可以以他们想要的任何形式编写)。示例数据框如下图(这只是一个示例,我的原始数据集有50000多行):
structure(list(comment = c("3.22%-1ST 0K/1.15% BAL", "3.25% ON 1ST 0,000/1.1625% ON BAL",
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL",
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL",
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST 0K + 1.1625% BALANCE",
"3.255% ON 1ST 0K & 1.1625% ON BALANCE", "3.225% ON 1ST 0,000 AND 1.16% ON BALANCE",
"3.255% ON FIRST 0,000 AND 1.1625% ON BALANCE", "00", ",500",
",000", ",000", "00.00", ",000 PLUS BONUS ,000",
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST 0,000 + 1.15% ON BAL",
"THE GREATER ,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST 0,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA,
-26L), class = c("tbl_df", "tbl", "data.frame"))
1 3.22%-1ST 0K/1.15% BAL TRUE
2 3.25% ON 1ST 0,000/1.1625% ON BAL TRUE
3 3.225% 1ST 100K/1.1625 ON BAL TRUE
4 3.22% 1ST 100K/1.15% ON BAL TRUE
5 3.255% 1ST 100K/1.1625% ON BAL TRUE
6 3.2% 1ST 100K/1.15% ON BAL TRUE
7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
8 3.255% 1ST 100K/1.1625% ON BAL TRUE
9 3.22% ON 1ST 100K / 1.1625% ON BAL TRUE
10 3.2% 1ST 100K/1.15% ON BAL TRUE
11 ............................ ....
如您所见,此数据框没有特定格式,因此很难使用。
我想做的事: 我想更改所有数字,例如 3.255%、3.2%、3.22%、4 等(基本上,每个数字中的 0 到 5 范围内的数字)评论格式相同,如 x.yz%
格式。
挑战是什么?这里的主要挑战是有些行不是以 $4000 或“THE GREATER $3,000 OR .5% OF SALE PRICE”这样的数字开头显然是不同的格式。一种方法是分隔以数字开头的行,并暂时用“TRUE”或“FALSE”标记它们。我在下面写了命令(但不确定这是否是个好主意!):
df_com$a <- str_detect(df_com$comment, pattern = "^\d")
但是,使用它我们可能会错过诸如“THE GREATER $3,000 OR .5% OF SALE PRICE”之类的行。此行应更改为“THE GREATER $3,000 OR .50% OF SALE PRICE”.
此外,为了替换数字并将它们四舍五入,我按照此处解释的答案进行操作:
gsubfn("(\d\w{1:3})", ~format(round(as.numeric(x), 2), nsmall=2), x)
但是,这个表达式不起作用。
预期的结果是这种形式:
1 3.22%-1ST 0K/1.15% BAL TRUE
2 3.25% ON 1ST 0,000/1.16% ON BAL TRUE
3 3.23% 1ST 100K/1.16 ON BAL TRUE
4 3.22% 1ST 100K/1.15% ON BAL TRUE
5 3.26% 1ST 100K/1.16% ON BAL TRUE
6 3.20% 1ST 100K/1.15% ON BAL TRUE
7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
8 3.26% 1ST 100K/1.16% ON BAL TRUE
9 3.22% ON 1ST 100K / 1.16% ON BAL TRUE
10 3.20% 1ST 100K/1.15% ON BAL TRUE
11 ............................ ....
关于如何完成这项任务有什么建议吗?
在 base R 中,这是使用 gregexpr
和 regmatches
的好地方:
gre <- gregexpr("\b?[0-9]*\.[0-9]*(?=%)", df_com$comment, perl = TRUE)
str(regmatches(df_com$comment, gre))
# List of 26
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr "3.23"
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.22" "1.16"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.16"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.23" "1.16"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr(0)
# $ : chr [1:2] "3.25" "1.16"
regmatches(df_com$comment, gre) <-
lapply(regmatches(df_com$comment, gre), function(nums) {
format(round(as.numeric(nums), 2), nsmall=2)
})
结果:
df_com
# # A tibble: 26 x 2
# comment a
# <chr> <lgl>
# 1 3.22%-1ST 0K/1.15% BAL TRUE
# 2 3.25% ON 1ST 0,000/1.16% ON BAL TRUE
# 3 3.23% 1ST 100K/1.1625 ON BAL TRUE
# 4 3.22% 1ST 100K/1.15% ON BAL TRUE
# 5 3.25% 1ST 100K/1.16% ON BAL TRUE
# 6 3.20% 1ST 100K/1.15% ON BAL TRUE
# 7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
# 8 3.25% 1ST 100K/1.16% ON BAL TRUE
# 9 3.22% ON 1ST 100K / 1.16% ON BAL TRUE
# 10 3.20% 1ST 100K/1.15% ON BAL TRUE
# # ... with 16 more rows
我用了x.yz%
的字面意思,这就解释了为什么第3行有一个1.1625
没变。
我们可以使用 {stringr} 包中的 str_replace_all
并使用一个函数作为 replacement
参数。如果符合您的需要,请参阅下面我更新的答案。
library(dplyr)
library(stringr)
dat <- structure(list(comment = c("3.22%-1ST 0K/1.15% BAL", "3.25% ON 1ST 0,000/1.1625% ON BAL",
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL",
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL",
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST 0K + 1.1625% BALANCE",
"3.255% ON 1ST 0K & 1.1625% ON BALANCE", "3.225% ON 1ST 0,000 AND 1.16% ON BALANCE",
"3.255% ON FIRST 0,000 AND 1.1625% ON BALANCE", "00", ",500",
",000", ",000", "00.00", ",000 PLUS BONUS ,000",
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST 0,000 + 1.15% ON BAL",
"THE GREATER ,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST 0,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA,
-26L), class = c("tbl_df", "tbl", "data.frame"))
dat2 <- dat %>%
mutate(comment2 =
str_replace_all(comment,
"[0-9]\.[0-9]*",
function(x) format(round(as.numeric(x), 2), nsmall = 2)) %>%
str_replace_all("\s\.[0-9]*",
function(x) paste0(" ", format(round(as.numeric(x), 2), nsmall = 2))) %>%
str_replace_all("([0-9])-",
"\1.00%-")
)
tail(dat2)
#> # A tibble: 6 x 3
#> comment a comment2
#> <chr> <lgl> <chr>
#> 1 ,000 PLUS BONUS ,000 FALSE ,000 PLUS BONUS ,000
#> 2 4-100/1.1625 TRUE 4.00%-100/1.16
#> 3 3.2% 1ST 100K/1.15% ON BAL TRUE 3.20% 1ST 100K/1.15% ON BAL
#> 4 3.2% ON 1ST 0,000 + 1.15% ON BAL TRUE 3.20% ON 1ST 0,000 + 1.15% ON ~
#> 5 THE GREATER ,000 OR .5% OF SALE PR~ FALSE THE GREATER ,000 OR 0.50% OF SA~
#> 6 **3.255% ON THE 1ST 0,000 AND 1.1~ FALSE **3.26% ON THE 1ST 0,000 AND 1~
由 reprex package (v0.3.0)
于 2020-11-05 创建