使用 R 调整 skimr 包中的 spark graphs/histograms

Question

我正在编写一份报告，该报告将显示某些李克特量表数据的结果。我想使用 skimr 包中的 skim() 函数来利用 spark graphs/histogram 视觉效果。问题是我的每个问题的回答选项范围从 1 到 5，但我的一些问题只收集了 3 到 5 范围内的回答（未选择回答选项 1 和 2）。直方图显示五列，范围似乎代表 3、3.5、4、4.5、5，而不是从 1 到 5。如何让 skimr 显示选项 1 到 5？提前感谢您的帮助。

示例：

数据：

Var1 Var2   Var3    Var4    Var5    Var6    Var7  Var8
1     3      3       3      1        3       4       4
5     5      5       4      2        5       5       5
5     5      5       5      5        5       5       5
5     5      5       4      2        5       5       5
5     5      5       4      2        5       5       5

我使用以下代码：

skim(Data)

我希望直方图（“hist”列）显示响应 1 到 5。但是对于变量 2、3、4、6、7、8，它只显示 3 或 4 到 5 的值。是否有有什么办法可以调整吗？

Answer 1

你好像有点误会了。
让我们以 tibble 的形式获取未更改的数据，并将其放入 skim 函数中。

library(tidyverse)
library(skimr)

df = read.table(
  header = TRUE,text="
Var1 Var2   Var3    Var4    Var5    Var6    Var7  Var8
1     3      3       3      1        3       4       4
5     5      5       4      2        5       5       5
5     5      5       5      5        5       5       5
5     5      5       4      2        5       5       5
5     5      5       4      2        5       5       5
") %>% as_tibble() 


df %>% skim()

我们在输出中得到这个

-- Data Summary ------------------------
                           Values    
Name                       Piped data
Number of rows             5         
Number of columns          8         
_______________________              
Column type frequency:               
  numeric                  8         
________________________             
Group variables            None      

-- Variable type: numeric ---------------------------------------------------------------------------------------------
# A tibble: 8 x 11
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
* <chr>             <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Var1                  0             1   4.2 1.79      1     5     5     5     5 ▂▁▁▁▇
2 Var2                  0             1   4.6 0.894     3     5     5     5     5 ▂▁▁▁▇
3 Var3                  0             1   4.6 0.894     3     5     5     5     5 ▂▁▁▁▇
4 Var4                  0             1   4   0.707     3     4     4     4     5 ▂▁▇▁▂
5 Var5                  0             1   2.4 1.52      1     2     2     2     5 ▂▇▁▁▂
6 Var6                  0             1   4.6 0.894     3     5     5     5     5 ▂▁▁▁▇
7 Var7                  0             1   4.8 0.447     4     5     5     5     5 ▂▁▁▁▇
8 Var8                  0             1   4.8 0.447     4     5     5     5     5 ▂▁▁▁▇

但是，您确实写道您的数据是李克特量表。对于此类数据，计算均值、标准差等是没有意义的，因为变量 Var1 的平均值是 4.2 是什么意思？我无法解释它。
然后我们必须将所有变量变异为因子类型。

df %>% mutate_all(~factor(., 1:5)) %>% skim()

输出

-- Data Summary ------------------------
                           Values    
Name                       Piped data
Number of rows             5         
Number of columns          8         
_______________________              
Column type frequency:               
  factor                   8         
________________________             
Group variables            None      

-- Variable type: factor ----------------------------------------------------------------------------------------------
# A tibble: 8 x 6
  skim_variable n_missing complete_rate ordered n_unique top_counts            
* <chr>             <int>         <dbl> <lgl>      <int> <chr>                 
1 Var1                  0             1 FALSE          2 5: 4, 1: 1, 2: 0, 3: 0
2 Var2                  0             1 FALSE          2 5: 4, 3: 1, 1: 0, 2: 0
3 Var3                  0             1 FALSE          2 5: 4, 3: 1, 1: 0, 2: 0
4 Var4                  0             1 FALSE          3 4: 3, 3: 1, 5: 1, 1: 0
5 Var5                  0             1 FALSE          3 2: 3, 1: 1, 5: 1, 3: 0
6 Var6                  0             1 FALSE          2 5: 4, 3: 1, 1: 0, 2: 0
7 Var7                  0             1 FALSE          2 5: 4, 4: 1, 1: 0, 2: 0
8 Var8                  0             1 FALSE          2 5: 4, 4: 1, 1: 0, 2: 0

现在更有意义了。可以看出，对于变量 Var1 我们有 4 个答案 5，一个答案 1，剩下零个，不管答案类型 5 是什么意思。
但是，现在没有直方图。好吧，我们可以很容易地自己生产它们。

df %>% mutate_all(~factor(., 1:5)) %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(value))+
  geom_histogram(stat="count")+
  facet_grid(rows=vars(name))

最后一点小提示。处理数据时，称其更有意义。根据您的比例输入相同的值。因此，我将您的变量稍微更改为问题，并将答案值更改为以下级别“绝对是，是，我不知道，不，绝对不是”。

df = read.table(
  header = TRUE,text="
Question1 Question2   Question3    Question4    Question5    Question6    Question7  Question8
def.not     don't.know      don't.know       don't.know      def.not        don't.know          yes          yes
def.yes     def.yes      def.yes          yes           not        def.yes       def.yes       def.yes
def.yes     def.yes      def.yes       def.yes      def.yes        def.yes       def.yes       def.yes
def.yes     def.yes      def.yes          yes           not        def.yes       def.yes       def.yes
def.yes     def.yes      def.yes          yes           not        def.yes       def.yes       def.yes
") %>% as_tibble() %>% mutate_all(~factor(., c("def.not", "not", "don't.know", "yes", "def.yes")))

输出

# A tibble: 5 x 8
  Question1 Question2  Question3  Question4  Question5 Question6  Question7 Question8
  <fct>     <fct>      <fct>      <fct>      <fct>     <fct>      <fct>     <fct>    
1 def.not   don't.know don't.know don't.know def.not   don't.know yes       yes      
2 def.yes   def.yes    def.yes    yes        not       def.yes    def.yes   def.yes  
3 def.yes   def.yes    def.yes    def.yes    def.yes   def.yes    def.yes   def.yes  
4 def.yes   def.yes    def.yes    yes        not       def.yes    def.yes   def.yes  
5 def.yes   def.yes    def.yes    yes        not       def.yes    def.yes   def.yes

现在你的直方图会更清晰，你不觉得吗？

df %>% pivot_longer(everything()) %>% 
  ggplot(aes(value))+
  geom_histogram(stat="count")+
  facet_grid(rows=vars(name))

使用 R 调整 skimr 包中的 spark graphs/histograms

Adjusting spark graphs/histograms in skimr package using R

r

data-visualization

skimr

data-wrangling