是否可以计算 R 中每个数据框列的大小?

Is it possible to calculate the size of each data frame column in R?

在 R 中,可以获得整个对象的对象大小:

> object.size(dplyr::starwars)
50632 bytes

如果您检查数据框,您会发现并非所有列的内容都相似:

> head(dplyr::starwars)
# A tibble: 6 x 13
  name   height  mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles
  <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis> <list>  
1 Luke …    172   77. blond      fair       blue            19.0 male   Tatooine  Human   <chr… <chr [2…
2 C-3PO     167   75. NA         gold       yellow         112.  NA     Tatooine  Droid   <chr… <chr [0…
3 R2-D2      96   32. NA         white, bl… red             33.0 NA     Naboo     Droid   <chr… <chr [0…
4 Darth…    202  136. none       white      yellow          41.9 male   Tatooine  Human   <chr… <chr [0…
5 Leia …    150   49. brown      light      brown           19.0 female Alderaan  Human   <chr… <chr [1…
6 Owen …    178  120. brown, gr… light      blue            52.0 male   Tatooine  Human   <chr… <chr [0…
# ... with 1 more variable: starships <list>

显然,height 将比 hair_color 占用更少 space。有没有办法检查哪些列最大?例如,如果你有一个大数据框,你可能想看看是否有一些列占用了不成比例的 space.

只需使用lapply/sapply遍历所有列

library(dplyr)

sapply(starwars, object.size)

# name     height       mass hair_color skin_color  eye_color birth_year     gender 
# 5576        392        736       1336       2400       1480        736        936 

# homeworld    species      films   vehicles  starships 
#      3216       2648      17920       5136       6496 

如果您有兴趣了解最大的顶部列,您可以做

sapply(starwars, object.size) %>%
            data.frame() %>%
            add_rownames() %>%
            top_n(5)


#  rowname       .
#  <chr>     <dbl>
#1 name       5576
#2 homeworld  3216
#3 films     17920
#4 vehicles   5136
#5 starships  6496

tail(sort(sapply(starwars, object.size)), 5)

#homeworld  vehicles      name starships     films 
#     3216      5136      5576      6496     17920