如何更改 R 中分类变量的最高值（按频率）期望的所有值

Question

我在 R 中有一个数据框，它看起来类似于下面的数据框，其因子变量为 "Genre":

|Genre|Listening Time|
|Rock |1:05          |
|Pop  |3:10          |
|RnB  |4:12          |
|Rock |2:34          |
|Pop  |5:01          |
|RnB  |4:01          |
|Rock |1:34          |
|Pop  |2:04          |

我想保留前 15 个流派（按计数）不变，只重命名前 15 个不的所有其他流派。这些应该重命名为单词"Other"。

换句话说 - 例如，如果流派 "RnB" 不在前 15 个流派中，则应将其替换为 "Other"。

我想要得到的 table 看起来像这样：

|Genre|Listening Time|
|Rock |1:05          |
|Pop  |3:10          |
|Other|4:12          |
|Rock |2:34          |
|Pop  |5:01          |
|Other|4:01          |
|Rock |1:34          |
|Pop  |2:04          |

我将如何处理这个问题？谢谢！

Answer 1

尝试将 df 替换为您的 data.frame 以检查是否获得所需的输出：

df <- data.frame(Genre=sample(letters, 1000, replace=TRUE),
                 ListeningTime=runif(1000, 3, 5))

> head(df)
  Genre ListeningTime
1     j      3.437013
2     n      4.151121
3     p      3.109044
4     z      4.529619
5     h      4.043982
6     i      3.590463

freq <- table(df$Genre)
sorted <- sort(freq, decreasing=TRUE)  # Sorted by frequency of df$Genre

> sorted
 d  x  o  q  r  u  g  i  j  f  a  p  b  e  v  n  w  c  k  m  z  l  h  t  y  s 
53 50 46 45 45 42 41 41 40 39 38 38 37 37 37 36 36 35 35 35 35 34 33 33 30 29

not_top_15 <- names(sorted[-1*1:15])  # The Genres not in the top 15
pos <- which(df$Genre %in% not_top_15)  # Their position in df

> head(df[pos, ])  # The original data, without the top 15 Genres
   Genre ListeningTime
2      n      4.151121
4      z      4.529619
5      h      4.043982
7      s      3.521054
16     w      3.528091
18     h      4.588815

Answer 2

如果您想调查 tidyverse，您可以这样做。我试图模仿您的数据框，但添加了更多行。

您从数据开始 > group_by 类型 > 顺序 > 选择前 5 名


library(tidyverse)

set.seed(1)
Data <- data.frame(
  listen = format(as.POSIXlt(paste0(
      as.character(sample(1:5)),
      ':',
      as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
  Genre = sample(c("Rock", "Pop", 'RnB'), 120, replace = TRUE)
)


Data %>%
  group_by(Genre ) %>%
  arrange(desc(listen)) %>% 
  select(listen) %>% 
  top_n(5) %>% 
  arrange(Genre)
#> Adding missing grouping variables: `Genre`
#> Selecting by listen
#> # A tibble: 15 x 2
#> # Groups:   Genre [3]
#>    Genre listen
#>    <chr> <chr> 
#>  1 Pop   05:47 
#>  2 Pop   05:47 
#>  3 Pop   05:43 
#>  4 Pop   05:41 
#>  5 Pop   05:28 
#>  6 RnB   05:54 
#>  7 RnB   05:44 
#>  8 RnB   05:43 
#>  9 RnB   05:29 
#> 10 RnB   05:28 
#> 11 Rock  05:54 
#> 12 Rock  05:44 
#> 13 Rock  05:41 
#> 14 Rock  05:29 
#> 15 Rock  05:26

抱歉，如果我误解了您的意思。如果您将代码分配给新的 data.frame 并为原始 DF 创建 anti_join 然后将 Genre 变异为 others 它应该是您想要的 - 我猜.

df <- Data %>%
  group_by(Genre ) %>%
  arrange(desc(listen)) %>% 
  select(listen) %>% 
  top_n(5) %>% 
  arrange(Genre) 

# make an anti_join and assign 'other' to Genre

anti_join(Data, df) %>% 
  mutate(Genre = 'others')

下一次编辑

希望我现在已经理解你的问题了。您只想计算流派在数据中出现的频率，并将不属于前 15 名的流派命名为 Others。也许我被您提供的仅显示 3 种流派的数据框误导了。所以我在 Wikipedia 中查找并添加了一些，发明了一些自己的 Genres 并使用 LETTERS 建立了一个具有足够数量的 Genre 的 DF。

用count(Genre)统计流派的出现次数，然后按降序排列。然后我引入了一个带有行号的新列。如果需要，您可以删除它，因为它只用于下一步，即引入另一列 - 我选择创建一个新列，而不是重命名 Genre[=41= 中的所有名称] - 名称为 Top15 给每个流派（在行中）16 或更晚的名称 Others 并保持其余不变。

head(20) 仅打印此 DF 的前 20 行。

library(tidyverse) set.seed(1) Data <- data.frame( listen = format(as.POSIXlt(paste0( as.character(sample(1:5)), ':', as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'), Genre = sample(c("Rock", "Pop", 'RnB', 'Opera', 'Birthday Songs', 'HipHop', 'Chinese Songs', 'Napoli Lovesongs', 'Benga', 'Bongo', 'Kawito', 'Noise', 'County Blues','Mambo', 'Reggae', LETTERS[0:24]), 300, replace = TRUE) ) Data %>% count(Genre) %>% arrange(desc(n)) %>% mutate(place = row_number()) %>% mutate(Top15 = ifelse(place > 15, 'Others', Genre)) %>% head(20) #> # A tibble: 20 x 4 #> Genre n place Top15 #> <chr> <int> <int> <chr> #> 1 N 15 1 N #> 2 T 13 2 T #> 3 V 13 3 V #> 4 K 12 4 K #> 5 Rock 11 5 Rock #> 6 X 11 6 X #> 7 E 10 7 E #> 8 W 10 8 W #> 9 Benga 9 9 Benga #> 10 County Blues 9 10 County Blues #> 11 G 9 11 G #> 12 J 9 12 J #> 13 M 9 13 M #> 14 Reggae 9 14 Reggae #> 15 B 8 15 B #> 16 D 8 16 Others #> 17 I 8 17 Others #> 18 P 8 18 Others #> 19 R 8 19 Others #> 20 S 8 20 Others

我希望这就是您要找的东西

Answer 3

library(dplyr)

set.seed(123)
compute_listen_time <- function(n.songs) {
  min <- sample(1:15, n.songs, replace = TRUE)
  sec <- sample(0:59, n.songs, replace = TRUE)
  sec <- ifelse(sec > 10, sec, paste0("0", sec))
  paste0(min, ":", sec)
}



df <- data.frame(
  Genre = sample(c("Rock", "Pop", "RnB", "Rock", "Pop"), 100, replace = TRUE),
  Listen_Time = compute_listen_time(100)
)


df <- add_count(df, Genre, name = "count") %>%
  mutate(
    rank = dense_rank(desc(count)),
    group = ifelse(rank <= 15, Genre, "other")
  )
df

Answer 4

我可以想到一个 data.table 解决方案。假设您的 data.frame 被称为 music，那么：

library(data.table)
setDT(music)

other_genres <- music[, .N, by = genre][order(-N)][16:.N, genre]

music[genre %chin% other_genres, genre := "other"]

第一行有效代码按流派统计出场次数，从大到小排序，从16到最后一个进行选择，将结果赋值给变量other_genres。第二行将检查该列表中的流派，并将其名称更新为 "other".

Answer 5

这里有一个非常简洁的解决方案，将 forcats 包应用于 diamonds 数据集，仅命名前 5 个 clarity 值并将其余值捆绑为 "Other"

library(dplyr)
library(forcats)

diamonds %>%
  mutate(clarity2 = fct_lump(fct_infreq(clarity), n = 5))

结果：

# A tibble: 53,940 x 11
   carat cut       color clarity depth table price     x     y     z clarity2
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>   
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 SI2     
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 SI1     
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31 VS1     
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 VS2     
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75 SI2     
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 VVS2    
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Other   
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 SI1     
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 VS2     
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39 VS1     
# … with 53,930 more rows

如何更改 R 中分类变量的最高值（按频率）期望的所有值

How to change all values expect for the top values (by frequency) from a categorical variable in R

grouping

r

rename

categories