影响使用 R 中的 hist() 函数绘制的直方图的变量

Variables that affect histogram plotted with hist() function in R

在 R 中可以绘制直方图并将其属性保存到变量:

> h1=hist(c(1,1,2,3,4,5,5), breaks=0.5:5.5)

然后可以读取这些属性:

> h1
$breaks
[1] 0.5 1.5 2.5 3.5 4.5 5.5

$counts
[1] 2 1 1 1 2

$density
[1] 0.2857143 0.1428571 0.1428571 0.1428571 0.2857143

$mids
[1] 1 2 3 4 5

$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

这些属性如何影响直方图?到目前为止,我已经弄清楚了以下内容:

$breaks$counts的关系。 $breaks表示绘制数据可能落入的区间,$counts表示落入该区间的数据量,例如:

[]表示closed interval(包括端点)

() 表示 open interval(不包括端点)

BREAKS  : COUNTS
[0.5-1.5] : 2 # There are two 1 which falls into this interval
(1.5-2.5] : 1 # There is one 2 which falls into this interval
(2.5-3.5] : 1 # There is one 3 which falls into this interval
(3.5-4.5] : 1 # There is one 4 which falls into this interval
(4.5-5.5] : 2 # There are two 5 which falls into this interval

$breaks$density的关系和上面基本一样,只是写成百分比,例如:

BREAKS  : DENSITY
[0.5-1.5] : 0.2857143 # This interval covers cca 28% of plot
(1.5-2.5] : 0.1428571 # This interval covers cca 14% of plot
(2.5-3.5] : 0.1428571 # This interval covers cca 14% of plot
(3.5-4.5] : 0.1428571 # This interval covers cca 14% of plot
(4.5-5.5] : 0.2857143 # This interval covers cca 28% of plot

当然,当您将所有这些值相加时,您将得到 1:

> sum(h1$density)
[1] 1

以下代表x轴名称:

$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"

但是剩下的做什么,尤其是 $mids

$mids
[1] 1 2 3 4 5

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

还有 help(hist) return 还有很多其他的,它们不应该也列在上面的输出中,如果不是为什么?正如 following 文章

中所述

By default, bin counts include values less than or equal to the bin's right break point and strictly greater than the bin's left break point, except for the leftmost bin, which includes its left break point.

所以如下:

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5)

将 return 直方图,其中 1.5 将落入 0.5-1.5 区间。一个 "workaround" 是将间隔大小设置得更小,例如

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=seq(0.5,5.5,0.1))

但这对我来说似乎不切实际,它还会向 $counts$density 添加一堆 0,有没有更好的自动方法?

除此之外,它还有一个我无法解释的副作用:为什么最后一个示例 return 在摘要 10 而不是 1?

> sum(h1$density)
[1] 10
> h1$density[h1$density>0]
[1] 2.50 1.25 1.25 1.25 1.25 2.50

问题1 $mids和$equidist是什么意思: 来自帮助文件:

mids: the n cell midpoints.

equidist: logical, indicating if the distances between breaks are all the same.


Q2:是的,有 h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5) 1.5 将属于 0.5-1.5 类别。如果你想让它落入1.5-2.5的类别,你应该使用:

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.49:5.49)

或更整洁:

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5, right=FALSE)

我不确定你想在这里自动化什么,但希望上面的内容能回答你的问题。如果不是,请我更清楚地说明你的问题。


Q3 关于密度是 10 而不是 1,那是因为密度不是频率。来自帮助文件:

density: values f^(x[i]), as estimated density values. If all(diff(breaks) == 1), they are the relative frequencies counts/n and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = breaks[i].

因此,如果您的间隔不等于 1,则密度之和不会等于 1。