如何在 Stata 中按组计算 "midpoint date"？

Question

我有一个包含两个变量的数据集：journalistName、articleDate

对于每个记者（组），我想创建一个变量，按时间顺序将文章分类为 "first half" 的 1 和 "second half" 的 2。

例如，如果一位记者写了 4 篇文章，我希望将前两篇文章归类为 1。

如果他写了 5 篇文章，我希望前三篇文章归类为 1。

我想到的一种可能是计算中点日期，然后使用 if 条件 (gen cat1 = 1 if midpoint > startdate)，但我不知道如何在 Stata 中生成这样的中点。

Answer 1

根据您对哪些文章归类为 1 的描述，您正在寻找 文章数量 的中点，而不是 日期范围.

一种解决方案是使用 by 组处理，_n 和 _N:

gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)

这按 author 和 date 排序，然后将 cat = 1 分配给每组 author 中的观察值，其中当前观察值 (_n) number 小于或等于观察值的中位数 (ceil(_N/2)).

请注意，您需要一个数字（而不是字符串）日期才能使排序正常进行。另外，在我看来，cat = {1,2} 不如 firsthalf = {0,1} 这样的直观。无论哪种方式，标记值 (help label) 都有助于清晰。

有关详细信息，请参阅 help by 和 this article。

最后，方法在行动：

clear all
input str10 author str10 datestr
"Alex" "09may2015"
"Alex" "06apr2015"
"Alex" "15jul2014"
"Alex" "19aug2013"
"Alex" "03mar2009"
"Betty" "09may2015"
"Betty" "06apr2015"
"Betty" "15jul2014"
"Betty" "19aug2013"
end

gen date = daily(datestr, "DMY")
format date %td

gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)

list , sepby(author) noobs

结果

  +--------------------------------------+
  | author     datestr        date   cat |
  |--------------------------------------|
  |   Alex   03mar2009   03mar2009     1 |
  |   Alex   19aug2013   19aug2013     1 |
  |   Alex   15jul2014   15jul2014     1 |
  |   Alex   06apr2015   06apr2015     2 |
  |   Alex   09may2015   09may2015     2 |
  |--------------------------------------|
  |  Betty   19aug2013   19aug2013     1 |
  |  Betty   15jul2014   15jul2014     1 |
  |  Betty   06apr2015   06apr2015     2 |
  |  Betty   09may2015   09may2015     2 |
  +--------------------------------------+

如果您确实想计算中点日期，您可以使用相同的一般原则来计算：

bysort author (date): gen beforemiddate = date <= ceil((date[_N] + date[1]) / 2)

另外，要查找"pre-midpoint"期间的最后一个日期，可以使用相同的原理：

bysort author cat (date): gen lastdate = date[_N] if cat == 1
by author: replace lastdate = lastdate[_n-1] if missing(lastdate)
format lastdate %td

或包含逻辑测试的 egen 函数可以更快地完成工作：

egen lastdate = max(date * (cat == 1)) , by(author)
format lastdate %td

如何在 Stata 中按组计算 "midpoint date"？

How to calculate "midpoint date" by group in Stata?

time

date

stata