从字符串匹配中总结
summarize from string matches
我有这个 df 列:
df <- data.frame(Strings = c("ñlas onepojasd", "onenañdsl", "ñelrtwofkld", "asdthreeasp", "asdfetwoasd", "fouroqwke","okasdtwo", "acmofour", "porefour", "okstwo"))
> df
Strings
1 ñlas onepojasd
2 onenañdsl
3 ñelrtwofkld
4 asdthreeasp
5 asdfetwoasd
6 fouroqwke
7 okasdtwo
8 acmofour
9 porefour
10 okstwo
我知道 df$Strings
中的每个值都会与单词 one, two, three or four
匹配。而且我也知道它只会与这些词中的一个相匹配。所以要匹配它们:
str_detect(df$Strings,"one")
str_detect(df$Strings,"two")
str_detect(df$Strings,"three")
str_detect(df$Strings,"four")
但是,我被困在这里,因为我正在尝试这样做 table:
Homes Quantity Percent
One 2 0.3
Two 4 0.4
Three 1 0.1
Four 3 0.3
Total 10 1
您可以使用 str_extract
,然后执行 table
和 prop.table
,即
library(stringr)
str_extract(df1$Strings, 'one|two|three|four')
#[1] "one" "one" "two" "three" "two" "four" "two" "four" "four" "two"
table(str_extract(df1$Strings, 'one|two|three|four'))
# four one three two
# 3 2 1 4
prop.table(table(str_extract(df1$Strings, 'one|two|three|four')))
# four one three two
# 0.3 0.2 0.1 0.4
使用 tidyverse
和 janitor
你可以:
df %>%
mutate(Homes = str_extract(Strings, "one|two|three|four"),
n = n()) %>%
group_by(Homes) %>%
summarise(Quantity = length(Homes),
Percent = first(length(Homes)/n)) %>%
adorn_totals("row")
Homes Quantity Percent
four 3 0.3
one 2 0.2
three 1 0.1
two 4 0.4
Total 10 1.0
或仅 tidyverse
:
df %>%
mutate(Homes = str_extract(Strings, "one|two|three|four"),
n = n()) %>%
group_by(Homes) %>%
summarise(Quantity = length(Homes),
Percent = first(length(Homes)/n)) %>%
rbind(., data.frame(Homes = "Total", Quantity = sum(.$Quantity),
Percent = sum(.$Percent)))
在这两种情况下,代码首先提取匹配模式并计算案例数。其次,它按匹配的词分组。第三,它计算每个单词的案例数和所有单词中给定单词的比例。最后,它添加了一个 "Total" 行。
base R
选项将是 regmatches/regexpr
和 table
table(regmatches(df$Strings, regexpr('one|two|three|four', df$Strings)))
# four one three two
# 3 2 1 4
加上 addmargins
得到 sum
然后除以
out <- addmargins(table(regmatches(df$Strings,
regexpr('one|two|three|four', df$Strings))))
out/out[length(out)]
# four one three two Sum
# 0.3 0.2 0.1 0.4 1.0
我有这个 df 列:
df <- data.frame(Strings = c("ñlas onepojasd", "onenañdsl", "ñelrtwofkld", "asdthreeasp", "asdfetwoasd", "fouroqwke","okasdtwo", "acmofour", "porefour", "okstwo"))
> df
Strings
1 ñlas onepojasd
2 onenañdsl
3 ñelrtwofkld
4 asdthreeasp
5 asdfetwoasd
6 fouroqwke
7 okasdtwo
8 acmofour
9 porefour
10 okstwo
我知道 df$Strings
中的每个值都会与单词 one, two, three or four
匹配。而且我也知道它只会与这些词中的一个相匹配。所以要匹配它们:
str_detect(df$Strings,"one")
str_detect(df$Strings,"two")
str_detect(df$Strings,"three")
str_detect(df$Strings,"four")
但是,我被困在这里,因为我正在尝试这样做 table:
Homes Quantity Percent
One 2 0.3
Two 4 0.4
Three 1 0.1
Four 3 0.3
Total 10 1
您可以使用 str_extract
,然后执行 table
和 prop.table
,即
library(stringr)
str_extract(df1$Strings, 'one|two|three|four')
#[1] "one" "one" "two" "three" "two" "four" "two" "four" "four" "two"
table(str_extract(df1$Strings, 'one|two|three|four'))
# four one three two
# 3 2 1 4
prop.table(table(str_extract(df1$Strings, 'one|two|three|four')))
# four one three two
# 0.3 0.2 0.1 0.4
使用 tidyverse
和 janitor
你可以:
df %>%
mutate(Homes = str_extract(Strings, "one|two|three|four"),
n = n()) %>%
group_by(Homes) %>%
summarise(Quantity = length(Homes),
Percent = first(length(Homes)/n)) %>%
adorn_totals("row")
Homes Quantity Percent
four 3 0.3
one 2 0.2
three 1 0.1
two 4 0.4
Total 10 1.0
或仅 tidyverse
:
df %>%
mutate(Homes = str_extract(Strings, "one|two|three|four"),
n = n()) %>%
group_by(Homes) %>%
summarise(Quantity = length(Homes),
Percent = first(length(Homes)/n)) %>%
rbind(., data.frame(Homes = "Total", Quantity = sum(.$Quantity),
Percent = sum(.$Percent)))
在这两种情况下,代码首先提取匹配模式并计算案例数。其次,它按匹配的词分组。第三,它计算每个单词的案例数和所有单词中给定单词的比例。最后,它添加了一个 "Total" 行。
base R
选项将是 regmatches/regexpr
和 table
table(regmatches(df$Strings, regexpr('one|two|three|four', df$Strings)))
# four one three two
# 3 2 1 4
加上 addmargins
得到 sum
然后除以
out <- addmargins(table(regmatches(df$Strings,
regexpr('one|two|three|four', df$Strings))))
out/out[length(out)]
# four one three two Sum
# 0.3 0.2 0.1 0.4 1.0