R:用最常见的变体替换字符串
R: Replacing Strings with their Most Common Variant
我希望对一组手动输入的字符串进行标准化,以便:
index fruit
1 Apple Pie
2 Apple Pie.
3 Apple. Pie
4 Apple Pie
5 Pear
应该看起来像:
index fruit
1 Apple Pie
2 Apple Pie
3 Apple Pie
4 Apple Pie
5 Pear
对于我的用例,按 phonetic 声音对它们进行分组很好,但我缺少关于如何用最常见的字符串替换最不常见的字符串的部分。
library(tidyverse)
library(stringdist)
index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")
df <- data.frame(index, fruit) %>%
mutate(grouping = phonetic(fruit)) %>%
add_count(fruit) %>%
# Missing Code
select(index, fruit)
我们可以使用 str_remove
删除 .
library(dplyr)
library(stringr)
data.frame(index, fruit) %>%
mutate(fruit = str_remove(fruit, "\."))
# index fruit
#1 1 Apple Pie
#2 2 Apple Pie
#3 3 Apple Pie
#4 4 Apple Pie
#5 5 Pear
如果我们需要使用phonetic
并找到最频繁的值
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
data.frame(index, fruit) %>%
mutate(grouping = phonetic(fruit)) %>%
group_by(grouping) %>%
mutate(fruit = Mode(fruit))
# A tibble: 5 x 3
# Groups: grouping [2]
# index fruit grouping
# <dbl> <fct> <chr>
#1 1 Apple Pie A141
#2 2 Apple Pie A141
#3 3 Apple Pie A141
#4 4 Apple Pie A141
#5 5 Pear P600
听起来你需要 group_by
分组,然后 select 最频繁的(模式)项目
df%>%mutate(grouping = phonetic(fruit))%>%
group_by(grouping)%>%
mutate(fruit = names(which.max(table(fruit))))
# A tibble: 5 x 3
# Groups: grouping [2]
index fruit grouping
<dbl> <fctr> <chr>
1 1 Apple Pie A141
2 2 Apple Pie A141
3 3 Apple Pie A141
4 4 Apple Pie A141
5 5 Pear P600
另一种方式可能是:
fruit %>%
enframe() %>%
mutate(grouping = phonetic(fruit)) %>%
add_count(value, grouping) %>%
group_by(grouping) %>%
mutate(value = value[match(max(n), n)]) %>%
select(-n) %>%
ungroup()
name value grouping
<int> <chr> <chr>
1 1 Apple Pie A141
2 2 Apple Pie A141
3 3 Apple Pie A141
4 4 Apple Pie A141
5 5 Pear P600
我希望对一组手动输入的字符串进行标准化,以便:
index fruit
1 Apple Pie
2 Apple Pie.
3 Apple. Pie
4 Apple Pie
5 Pear
应该看起来像:
index fruit
1 Apple Pie
2 Apple Pie
3 Apple Pie
4 Apple Pie
5 Pear
对于我的用例,按 phonetic 声音对它们进行分组很好,但我缺少关于如何用最常见的字符串替换最不常见的字符串的部分。
library(tidyverse)
library(stringdist)
index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")
df <- data.frame(index, fruit) %>%
mutate(grouping = phonetic(fruit)) %>%
add_count(fruit) %>%
# Missing Code
select(index, fruit)
我们可以使用 str_remove
删除 .
library(dplyr)
library(stringr)
data.frame(index, fruit) %>%
mutate(fruit = str_remove(fruit, "\."))
# index fruit
#1 1 Apple Pie
#2 2 Apple Pie
#3 3 Apple Pie
#4 4 Apple Pie
#5 5 Pear
如果我们需要使用phonetic
并找到最频繁的值
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
data.frame(index, fruit) %>%
mutate(grouping = phonetic(fruit)) %>%
group_by(grouping) %>%
mutate(fruit = Mode(fruit))
# A tibble: 5 x 3
# Groups: grouping [2]
# index fruit grouping
# <dbl> <fct> <chr>
#1 1 Apple Pie A141
#2 2 Apple Pie A141
#3 3 Apple Pie A141
#4 4 Apple Pie A141
#5 5 Pear P600
听起来你需要 group_by
分组,然后 select 最频繁的(模式)项目
df%>%mutate(grouping = phonetic(fruit))%>%
group_by(grouping)%>%
mutate(fruit = names(which.max(table(fruit))))
# A tibble: 5 x 3
# Groups: grouping [2]
index fruit grouping
<dbl> <fctr> <chr>
1 1 Apple Pie A141
2 2 Apple Pie A141
3 3 Apple Pie A141
4 4 Apple Pie A141
5 5 Pear P600
另一种方式可能是:
fruit %>%
enframe() %>%
mutate(grouping = phonetic(fruit)) %>%
add_count(value, grouping) %>%
group_by(grouping) %>%
mutate(value = value[match(max(n), n)]) %>%
select(-n) %>%
ungroup()
name value grouping
<int> <chr> <chr>
1 1 Apple Pie A141
2 2 Apple Pie A141
3 3 Apple Pie A141
4 4 Apple Pie A141
5 5 Pear P600