R:自动生成直方图

R: Automatically Producing Histograms

我正在使用 R 编程语言。我为此示例创建了以下数据集:

var_1 <- rnorm(1000,10,10)
var_2 <- rnorm(1000, 5, 5)
var_3 <- rnorm(1000, 6,18)

favorite_food <- c("pizza","ice cream", "sushi", "carrots", "onions", "broccoli", "spinach", "artichoke", "lima beans", "asparagus", "eggplant", "lettuce", "cucumbers")
favorite_food <-  sample(favorite_food, 1000, replace=TRUE, prob=c(0.5, 0.45, 0.04, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001))


response <- c("a","b")
response <- sample(response, 1000, replace=TRUE, prob=c(0.3, 0.7))


data = data.frame( var_1, var_2, var_3, favorite_food, response)

data$favorite_food = as.factor(data$favorite_food)
data$response = as.factor(data$response)

从这里开始,我想为这个数据集中的两个分类变量制作直方图并将它们放在同一页上:

#make histograms and put them on the same page (note: I don't know why the "par(mfrow = c(1,2))" statement is not working)
par(mfrow = c(1,2))

histogram(data$response, main = "response"))

histogram(data$favorite_food, main = "favorite food"))

我的问题:是否可以为给定数据集中的所有分类变量自动生成直方图(无需为每个变量手动编写“histogram()”语句)并将它们打印在同一页上?使用“ggplot2”库来解决这个问题是否更好?

我可以为数据集中的每个单独的分类变量手动编写“直方图 ()”语句,但我一直在寻找一种更快的方法来执行此操作。是否可以使用“for 循环”来做到这一点?

谢谢

一个ggplot2/tidyverse的解决方案是将每一列加长为数据,然后使用分面将它们全部绘制在同一页中:

(编辑为仅绘制因子变量)

factor_vars <- sapply(data, is.factor)

varnames <- names(data)

deselect_not_factors <- varnames[!factor_vars]

library(tidyr)
library(ggplot2)

data_long <- data %>%
  pivot_longer(
    cols = -deselect_not_factors,
    names_to = "category",
    values_to = "value"
  )

ggplot(data_long) +
  geom_bar(
    aes(x = value)
  ) +
  facet_wrap(~category, scales = "free")

下面是使用 cowplot & ggplot2

的尝试
library(ggplot2)
library(dplyr)
library(foreach)
library(cowplot)

list_variables <- c("response", "favorite_food")
all_plot <- foreach(current_var = c(list_variables)) %do% {
  # need to do this to avoid ggplot reference to same summary data afterward.
  data_summary_name <- paste0(current_var, "_summary")
  eval(substitute(
    {
      graph_data <- data %>%
        group_by(!!sym(current_var)) %>%
        summarize(count = n(), .groups = "drop") %>%
        mutate(share = count / sum(count))
      plot <- ggplot(graph_data) +
        geom_bar(mapping = aes(x = !!sym(current_var), y = share), width = 1,
          fill = "#00FFFF", color = "#000000", stat = "identity") +
        scale_y_continuous(labels = scales::percent) +
        ggtitle(current_var) + ylab("Perecent of Total") +
        theme_bw()
    }, list(graph_data = as.name(data_summary_name))
  )) 
  return(plot)
}

plot_grid(plotlist = all_plot, ncol = 2)

注意:关于我为什么使用 eval & substitue 的参考,你可以在 [=21= 上参考这个问题]

使用 facet_wrap 作为类似于 QuishSwash 的方法,但数据以份额计算

list_variables <- c("response", "favorite_food")
# Calculate share for choosen variables defined in list_variables 
# You can adjust by having some variables selection based on some condition
summary_df <- bind_rows(foreach(current_var = c(list_variables)) %do% {
  data %>%
    group_by(variable = !!sym(current_var)) %>%
    summarize(count = n(), .groups = "drop") %>%
    mutate(share = count / sum(count),
      variable_name = current_var)
})

ggplot(summary_df) +
  geom_bar(
    aes(x = variable, y = share),
    fill = "#00FFFF", color = "#000000", stat = "identity") +
  facet_wrap(~variable_name, scales = "free") +
  scale_y_continuous(labels = scales::percent) +
  theme_bw()

reprex package (v2.0.0)

于 2021-04-29 创建

作为替代方案,您可以利用神奇的 DataExplorer package

请注意,直方图适用于连续变量,因此,您想为分类变量创建条形图。这可以按如下方式完成:

if(require(DataExplorer)==FALSE) install.packages("DataExplorer"); library(DataExplorer)
DataExplorer::plot_histogram(data) # plots histograms for continuous variables
DataExplorer::plot_bar(data) # bar plots for categorical variables

详情请参考package manual

这是在 for 循环中使用 barplot 的基本 R 替代方案:

cols <- names(data)[sapply(data, is.factor)]


#This would need some manual adjustment if number of columns increase
par(mfrow = c(1,length(cols))) 

for(i in cols) {
  barplot(table(data[[i]]), main = i)
}