使用折叠计算字符串变量值的出现次数

Question

在如下所示的数据集中，

clear
input patid str2 dx 
1   qw
1   qe
1   qw
2   qw
2   qw
2   qs
2   qs
3   qe
3   qe
3   qs
3   qw
3   qw
3   qw
3   qs
4   qe
5   qa
5   qs
5   qw
5   qe
5   qw
end

我发现我可以使用下标 [1] 计算字符串变量 dx 的每个值的出现次数，或者如果我使用 [=15] 将 dx 转换为数字标签=][2].

在使用 collapse 时，是否有命令或语法可以让我直接从字符串变量本身计算出现次数（无需转换等）？

例如如果我尝试 collapse (count) countdx=dx, by(patid dx)，这个 returns 错误消息 variable dx not found。

（当然，这不应该起作用：当我尝试 collapse (count) countdx=dx, by(patid) 时，这个 returns 错误 type mismatch）

备注：

[1]

by patid dx, sort: egen ndx = count(dx)
by patid dx: g orderdx=_n
by patid dx: drop if orderdx>1

[2]

g numdx=.
replace numdx=1 if dx=="qa"
replace numdx=2 if dx=="qe"
replace numdx=3 if dx=="qs"
replace numdx=4 if dx=="qw"
collapse (count)  countdx=numdx, by(patid dx)

Answer 1

你的例子，而不是你的问题，都暗示你想为标识符的每个不同值单独计算 patid。

clear
input patid str2 dx 
1   qw
1   qe
1   qw
2   qw
2   qw
2   qs
2   qs
3   qe
3   qe
3   qs
3   qw
3   qw
3   qw
3   qs
4   qe
5   qa
5   qs
5   qw
5   qe
5   qw
end

bysort patid dx : gen count = _N 

tabdisp patid dx , c(count) 

----------------------------------
          |           dx          
    patid |   qa    qe    qs    qw
----------+-----------------------
        1 |          1           2
        2 |                2     2
        3 |          2     2     3
        4 |          1            
        5 |    1     1     1     2
----------------------------------

要回顾这方面的技术，请参阅 this paper. Searching Statalist 提及 dm0042 会找到很多相关示例。

即使是中等规模的问题，tabdisp 也不是特别实用。这里提到它是为了直接显示上一个命令的作用。

将其扩展到 collapse，一个简单的设备是

gen one = 1

collapse (sum) one, by(patid dx)

虽然我应该提到 contract 是为了这个目的而写的（参见 Cox 1998 中对其前身的讨论）。

另一方面，如果您确实创建了 count 变量，那么

collapse (mean) count, by(patid dx)

会产生完全相同的效果。

考克斯，N.J。 1998. 将数据集折叠成频率。 Stata 技术公告 44：2-3。 .pdf here

使用折叠计算字符串变量值的出现次数

Counting occurrences of values of string variable using collapse

string

count

collapse

stata