如何根据另一个变量的范围简单地创建新变量
How to Simply Create New Variable Based on Ranges of Another
假设我有 var1
是连续的:
clear
set obs 1000
gen var1 = runiform()
sum var1
现在我想根据 var1
的范围创建 var2
。我可以这样做:
gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75
我希望能够在一行中做到这一点。伪代码:
gen var2 = (ranges(0 .25 .5 .75 1) values("Lowest" "Low" "High" "Highest"))
在 Create categorical variable in R based on range
中可以找到使用 cut
执行与 R
非常相似的方法
在 Stata 中有没有类似 R 版本的命令?想象一下,有 10,000 个范围需要进入 var2
。那么更好的方法会有很大帮助。
另一种在 Stata 中在一行中执行此操作的方法很笨拙,可以在 http://www.stata.com/support/faqs/data-management/multiple-operations/:
找到
generate var2 = cond(var1<=.25, "Lowest", cond(var1<=.50, "Low", cond(var1<=.75, "High", cond(var1<=1.00, "Highest", ""))))
有没有更好的方法?
Stata 确实有一个 cut
函数,作为 egen
命令的一部分。使用它的选项并定义和分配一个值标签可以得到你想要的结果(虽然是三行而不是一行,但它们是相当简洁的三行)。 例如,
clear
set obs 15
gen var1 = runiform()
sum var1
gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75
// =======================================================
// Using egen , cut()
// =======================================================
label define rank 0 "Lowest" 1 "Low" 2 "High" 3 "Highest"
egen var3 = cut(var1) , at(0(.25)1) icodes
label values var3 rank
li
结果
+------------------------------+
| var1 var2 var3 |
|------------------------------|
1. | .6658295 High High |
2. | .3690664 Low Low |
3. | .5983131 High High |
4. | .2658775 Low Low |
5. | .1211114 Lowest Lowest |
|------------------------------|
6. | .2296222 Lowest Lowest |
7. | .7229139 High High |
8. | .2501513 Low Low |
9. | .7775574 Highest Highest |
10. | .2839603 Low Low |
|------------------------------|
11. | .8396428 Highest Highest |
12. | .4838379 Low Low |
13. | .2610629 Low Low |
14. | .3855471 Low Low |
15. | .3447088 Low Low |
+------------------------------+
cond()
函数是所谓的笨拙函数。有关示例,请参见下面的 var3
。它具有明显的优势,您可以在代码中明确显示不等式,并且完全按照您的意愿进行,而 egen, cut()
两者都不是这样。
在这个特定的例子中,至少还有一个技巧是可能的。请参阅下面的 var4
了解它是什么。
. clear
. set obs 15
number of observations (_N) was 0, now 15
. set seed 2803
. gen var1 = runiform()
. sort var1
. gen var2 = "Lowest" if var1<.25
(9 missing values generated)
. replace var2 = "Low" if var1>=.25 & var1<.5
(4 real changes made)
. replace var2 = "High" if var1>=.5 & var1<.75
(2 real changes made)
. replace var2 = "Highest" if var1>=.75
variable var2 was str6 now str7
(3 real changes made)
. gen var3 = cond(var1 < .25, "Lowest", cond(var1 <.5, "Low", cond(var1 <.75, "
> High", "Highest")))
. gen var4 = word("Lowest Low High Highest", ceil(4 * var1))
. list
+----------------------------------------+
| var1 var2 var3 var4 |
|----------------------------------------|
1. | .0200225 Lowest Lowest Lowest |
2. | .0360774 Lowest Lowest Lowest |
3. | .0934085 Lowest Lowest Lowest |
4. | .0950848 Lowest Lowest Lowest |
5. | .1040797 Lowest Lowest Lowest |
|----------------------------------------|
6. | .1795591 Lowest Lowest Lowest |
7. | .3326341 Low Low Low |
8. | .3383934 Low Low Low |
9. | .3870576 Low Low Low |
10. | .3980427 Low Low Low |
|----------------------------------------|
11. | .6264514 High High High |
12. | .6305373 High High High |
13. | .7739685 Highest Highest Highest |
14. | .7935746 Highest Highest Highest |
15. | .9243789 Highest Highest Highest |
+----------------------------------------+
但是,如果您真的有 10,000 个范围要指定,并且它们没有归结为一些简单的规则,那么您自然不会采用这两种方式中的任何一种。您应该将它们放在一个文件中,并使用一些基于 merge
的代码。
假设我有 var1
是连续的:
clear
set obs 1000
gen var1 = runiform()
sum var1
现在我想根据 var1
的范围创建 var2
。我可以这样做:
gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75
我希望能够在一行中做到这一点。伪代码:
gen var2 = (ranges(0 .25 .5 .75 1) values("Lowest" "Low" "High" "Highest"))
在 Create categorical variable in R based on range
中可以找到使用cut
执行与 R
非常相似的方法
在 Stata 中有没有类似 R 版本的命令?想象一下,有 10,000 个范围需要进入 var2
。那么更好的方法会有很大帮助。
另一种在 Stata 中在一行中执行此操作的方法很笨拙,可以在 http://www.stata.com/support/faqs/data-management/multiple-operations/:
找到generate var2 = cond(var1<=.25, "Lowest", cond(var1<=.50, "Low", cond(var1<=.75, "High", cond(var1<=1.00, "Highest", ""))))
有没有更好的方法?
Stata 确实有一个 cut
函数,作为 egen
命令的一部分。使用它的选项并定义和分配一个值标签可以得到你想要的结果(虽然是三行而不是一行,但它们是相当简洁的三行)。 例如,
clear
set obs 15
gen var1 = runiform()
sum var1
gen var2 = "Lowest" if var1<.25
replace var2 = "Low" if var1>=.25 & var1<.5
replace var2 = "High" if var1>=.5 & var1<.75
replace var2 = "Highest" if var1>=.75
// =======================================================
// Using egen , cut()
// =======================================================
label define rank 0 "Lowest" 1 "Low" 2 "High" 3 "Highest"
egen var3 = cut(var1) , at(0(.25)1) icodes
label values var3 rank
li
结果
+------------------------------+
| var1 var2 var3 |
|------------------------------|
1. | .6658295 High High |
2. | .3690664 Low Low |
3. | .5983131 High High |
4. | .2658775 Low Low |
5. | .1211114 Lowest Lowest |
|------------------------------|
6. | .2296222 Lowest Lowest |
7. | .7229139 High High |
8. | .2501513 Low Low |
9. | .7775574 Highest Highest |
10. | .2839603 Low Low |
|------------------------------|
11. | .8396428 Highest Highest |
12. | .4838379 Low Low |
13. | .2610629 Low Low |
14. | .3855471 Low Low |
15. | .3447088 Low Low |
+------------------------------+
cond()
函数是所谓的笨拙函数。有关示例,请参见下面的 var3
。它具有明显的优势,您可以在代码中明确显示不等式,并且完全按照您的意愿进行,而 egen, cut()
两者都不是这样。
在这个特定的例子中,至少还有一个技巧是可能的。请参阅下面的 var4
了解它是什么。
. clear
. set obs 15
number of observations (_N) was 0, now 15
. set seed 2803
. gen var1 = runiform()
. sort var1
. gen var2 = "Lowest" if var1<.25
(9 missing values generated)
. replace var2 = "Low" if var1>=.25 & var1<.5
(4 real changes made)
. replace var2 = "High" if var1>=.5 & var1<.75
(2 real changes made)
. replace var2 = "Highest" if var1>=.75
variable var2 was str6 now str7
(3 real changes made)
. gen var3 = cond(var1 < .25, "Lowest", cond(var1 <.5, "Low", cond(var1 <.75, "
> High", "Highest")))
. gen var4 = word("Lowest Low High Highest", ceil(4 * var1))
. list
+----------------------------------------+
| var1 var2 var3 var4 |
|----------------------------------------|
1. | .0200225 Lowest Lowest Lowest |
2. | .0360774 Lowest Lowest Lowest |
3. | .0934085 Lowest Lowest Lowest |
4. | .0950848 Lowest Lowest Lowest |
5. | .1040797 Lowest Lowest Lowest |
|----------------------------------------|
6. | .1795591 Lowest Lowest Lowest |
7. | .3326341 Low Low Low |
8. | .3383934 Low Low Low |
9. | .3870576 Low Low Low |
10. | .3980427 Low Low Low |
|----------------------------------------|
11. | .6264514 High High High |
12. | .6305373 High High High |
13. | .7739685 Highest Highest Highest |
14. | .7935746 Highest Highest Highest |
15. | .9243789 Highest Highest Highest |
+----------------------------------------+
但是,如果您真的有 10,000 个范围要指定,并且它们没有归结为一些简单的规则,那么您自然不会采用这两种方式中的任何一种。您应该将它们放在一个文件中,并使用一些基于 merge
的代码。