SAS SCAN 函数和缺失值
SAS SCAN Function and Missing Values
我正在尝试开发一个递归程序,以使用平坦概率来处理缺失的字符串值(例如,如果一个变量具有三个可能的值并且缺少一个观察值,则缺失的观察值将有 33% 被替换为任何价值)。
注意:此 post 的目的不是讨论插补技术的优点。
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
我的 SCAN 功能不会在性别中创建 F 或 M 观察。它还似乎创建了一个新的 M 和 F 变量。此外,DO 循环在 CustomerKey 下创建附加条目。有什么办法可以摆脱这些吗?
我更愿意使用循环和宏来解决这个问题。我还不精通数组。
这是我尝试整理一下的尝试:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
光环 8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
- 提示:您可以在 INPUT 期间使用点 (.) 表示字符变量的缺失值。
- 提示:DATALINES 是 CARDS 的现代替代品。
- 提示:数据值不必排列,但对人类有帮助。
因此这也有效:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
- 提示:您的技术需要对数据进行两次传递。
- 一个用于确定不同的值。
- 请稍候应用您的估算。
- 大多数方法要求处理每个变量两次。哈希方法只能进行两次传递,但需要更多内存。
有很多方法可以确定不同的值:SORTING+FIRST.、Proc FREQ、DATA Step HASH、SQL 等等。
提示:有时需要将数据从代码移回数据的解决方案,但这可能很麻烦。通常最干净的方法是让数据保持数据状态。
例如:如果连接的不同值需要超过 64K
,则 INTO 将是错误的方法
提示:数据到代码对于连续值和其他在成为代码时表示不完全相同的值尤其麻烦。
例如:高精度数值、带控制字符的字符串、带内嵌引号的字符串等...
这是一种使用 SQL 的方法。如前所述,Proc SURVEYSELECT 对于实际应用程序要好得多。
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
您不必担心没有缺失值的列,因为这些列不会发生替换。对于缺失的列,候选 REPLACEMENTS 列表排除了缺失,并且 REPLACEMENT_COUNT 对于计算均匀替换概率 1/COUNT 是正确的,编码为 rownum = ceil (random)
我正在尝试开发一个递归程序,以使用平坦概率来处理缺失的字符串值(例如,如果一个变量具有三个可能的值并且缺少一个观察值,则缺失的观察值将有 33% 被替换为任何价值)。
注意:此 post 的目的不是讨论插补技术的优点。
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
我的 SCAN 功能不会在性别中创建 F 或 M 观察。它还似乎创建了一个新的 M 和 F 变量。此外,DO 循环在 CustomerKey 下创建附加条目。有什么办法可以摆脱这些吗?
我更愿意使用循环和宏来解决这个问题。我还不精通数组。
这是我尝试整理一下的尝试:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
光环 8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
- 提示:您可以在 INPUT 期间使用点 (.) 表示字符变量的缺失值。
- 提示:DATALINES 是 CARDS 的现代替代品。
- 提示:数据值不必排列,但对人类有帮助。
因此这也有效:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
- 提示:您的技术需要对数据进行两次传递。
- 一个用于确定不同的值。
- 请稍候应用您的估算。
- 大多数方法要求处理每个变量两次。哈希方法只能进行两次传递,但需要更多内存。
有很多方法可以确定不同的值:SORTING+FIRST.、Proc FREQ、DATA Step HASH、SQL 等等。
提示:有时需要将数据从代码移回数据的解决方案,但这可能很麻烦。通常最干净的方法是让数据保持数据状态。
例如:如果连接的不同值需要超过 64K
,则 INTO 将是错误的方法提示:数据到代码对于连续值和其他在成为代码时表示不完全相同的值尤其麻烦。
例如:高精度数值、带控制字符的字符串、带内嵌引号的字符串等...
这是一种使用 SQL 的方法。如前所述,Proc SURVEYSELECT 对于实际应用程序要好得多。
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
您不必担心没有缺失值的列,因为这些列不会发生替换。对于缺失的列,候选 REPLACEMENTS 列表排除了缺失,并且 REPLACEMENT_COUNT 对于计算均匀替换概率 1/COUNT 是正确的,编码为 rownum = ceil (random)